1. Introduction

This notebook is to explore the data collected by ABC Bank to predict customer churn.

2. Imports

In [1]:
#Importing necessary packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas_profiling
In [3]:
#importing data
churn_data=pd.read_csv("../raw data/Churn_data.csv")
In [4]:
#Preview of data
churn_data.head()
Out[4]:
RowNumber CustomerId Surname CreditScore Geography Gender Age Tenure Balance NumOfProducts HasCrCard IsActiveMember EstimatedSalary Exited
0 1 15634602 Hargrave 619 France Female 42 2 0.00 1 1 1 101348.88 1
1 2 15647311 Hill 608 Spain Female 41 1 83807.86 1 0 1 112542.58 0
2 3 15619304 Onio 502 France Female 42 8 159660.80 3 1 0 113931.57 1
3 4 15701354 Boni 699 France Female 39 1 0.00 2 0 0 93826.63 0
4 5 15737888 Mitchell 850 Spain Female 43 2 125510.82 1 1 1 79084.10 0
In [5]:
#Reviewing data types
churn_data.dtypes
Out[5]:
RowNumber            int64
CustomerId           int64
Surname             object
CreditScore          int64
Geography           object
Gender              object
Age                  int64
Tenure               int64
Balance            float64
NumOfProducts        int64
HasCrCard            int64
IsActiveMember       int64
EstimatedSalary    float64
Exited               int64
dtype: object

The data has 13 independant features of various data types (int,float,object) and the exited column is the dependant feature of interest.

The HasCrCard and IsActiveMember are formatted as int64, but these are categorical features, so we should convert it into objects.

The dependant feature is formatted as int64, but we might want to convert it into object to make this a classification problem.

The customer name feature might not be of much use. This could be discarded later.

In [8]:
#Converting into correct formats
churn_data["Geography"]=churn_data["Geography"].astype("category")
churn_data["Gender"]=churn_data["Gender"].astype("category")
churn_data["HasCrCard"]=churn_data["HasCrCard"].astype("category")
churn_data["IsActiveMember"]=churn_data["IsActiveMember"].astype("category")
churn_data["Exited"]=churn_data["Exited"].astype("category")
churn_data.dtypes
Out[8]:
RowNumber             int64
CustomerId            int64
Surname              object
CreditScore           int64
Geography          category
Gender             category
Age                   int64
Tenure                int64
Balance             float64
NumOfProducts         int64
HasCrCard          category
IsActiveMember     category
EstimatedSalary     float64
Exited             category
dtype: object
In [9]:
#Reviewing data info
churn_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 14 columns):
 #   Column           Non-Null Count  Dtype   
---  ------           --------------  -----   
 0   RowNumber        10000 non-null  int64   
 1   CustomerId       10000 non-null  int64   
 2   Surname          10000 non-null  object  
 3   CreditScore      10000 non-null  int64   
 4   Geography        10000 non-null  category
 5   Gender           10000 non-null  category
 6   Age              10000 non-null  int64   
 7   Tenure           10000 non-null  int64   
 8   Balance          10000 non-null  float64 
 9   NumOfProducts    10000 non-null  int64   
 10  HasCrCard        10000 non-null  category
 11  IsActiveMember   10000 non-null  category
 12  EstimatedSalary  10000 non-null  float64 
 13  Exited           10000 non-null  category
dtypes: category(5), float64(2), int64(6), object(1)
memory usage: 752.6+ KB

The dataset contains 10000 rows of data and there are no missing values present. So, we can continue with our data exploration

In [10]:
#Reviewing data range for numercial variables
churn_data.describe()
Out[10]:
RowNumber CustomerId CreditScore Age Tenure Balance NumOfProducts EstimatedSalary
count 10000.00000 1.000000e+04 10000.000000 10000.000000 10000.000000 10000.000000 10000.000000 10000.000000
mean 5000.50000 1.569094e+07 650.528800 38.921800 5.012800 76485.889288 1.530200 100090.239881
std 2886.89568 7.193619e+04 96.653299 10.487806 2.892174 62397.405202 0.581654 57510.492818
min 1.00000 1.556570e+07 350.000000 18.000000 0.000000 0.000000 1.000000 11.580000
25% 2500.75000 1.562853e+07 584.000000 32.000000 3.000000 0.000000 1.000000 51002.110000
50% 5000.50000 1.569074e+07 652.000000 37.000000 5.000000 97198.540000 1.000000 100193.915000
75% 7500.25000 1.575323e+07 718.000000 44.000000 7.000000 127644.240000 2.000000 149388.247500
max 10000.00000 1.581569e+07 850.000000 92.000000 10.000000 250898.090000 4.000000 199992.480000
In [15]:
#Reviewing categorical variables
churn_data.describe(include=['category'])
Out[15]:
Geography Gender HasCrCard IsActiveMember Exited
count 10000 10000 10000 10000 10000
unique 3 2 2 2 2
top France Male 1 1 0
freq 5014 5457 7055 5151 7963
In [21]:
#Get categories of independant variables
print(churn_data.Geography.value_counts())
print(churn_data.Gender.value_counts())
print(churn_data.HasCrCard.value_counts())
print(churn_data.IsActiveMember.value_counts())
France     5014
Germany    2509
Spain      2477
Name: Geography, dtype: int64
Male      5457
Female    4543
Name: Gender, dtype: int64
1    7055
0    2945
Name: HasCrCard, dtype: int64
1    5151
0    4849
Name: IsActiveMember, dtype: int64
In [22]:
#Get categories of dependant variables
print(churn_data.Exited.value_counts())
0    7963
1    2037
Name: Exited, dtype: int64

Insights:

The customers are spread across 3 countries (France, Germany and Spain)
The male and female customers are uniformly distributed
About 70% of customers have credit card
50% of customers are active members

About 20% of customers have churned

In [23]:
profile_report = churn_data.profile_report(explorative=False, html={'style': {'full_width': True}})
profile_report
Out[23]:

In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]: